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Abstract. Dialogism is a philosophical theory centered on the idea that 
life involves a dialogue among multiple voices in a continuous exchange 
and interaction. Considering human language, different ideas or points 
of view take the form of voices, which spread throughout any discourse 
and influence it. From a computational point of view, voices can be 
operationlized as semantic chains that contain related words. This study 
introduces and evaluates a novel method of identifying semantic chains 
using BERT, a state-of-the-art language model for computational lin- 
guistics. The resulting model generalizes to multiple relations including 
repetitions, semantically related concepts from WordNet (i.e., synonyms, 
hypernyms, hyponyms, and siblings), as well as pronominal resolutions. 
By combining the attention scores between words, word pairs are merged 
into connected components that denote emerging voices from the dis- 
course. The introduced visualization argues for a more dense capturing 
of inner semantic links between words and even compound words in con- 
trast to classical methods of building lexical chains. 
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1 Introduction 


Dialogism is a philosophical theory introduced by Mikhail Bakhtin [1,2], centered 
on the idea that everything, even life, is dialogic, a continual exchange and 
interaction between voices: “Life by its very nature is dialogic... when dialogue 
ends, everything ends” [2]. Trausan-Matu et al. [3] extended the concept of 
voice for analyzing discourse, in general, and collaborative learning, in particular. 
They consider voices to be generalized representations of different points of view 
or ideas, which spread throughout the discourse, and influence it. Voices were 
subsequently operationalized by Dascalu et al. [4] as semantic chains that were 
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obtained by combining lexical chains, i.e., sequences of repeated or related words, 
including synonyms or hypernyms [5]. Semantic chains propagate along sentences 
and help create narrative threads throughout the text. 

Recent studies on building lexical chains consider word repetitions, syn- 
onyms, and semantic relationships between nouns [6]. Mukherjee et al. [6] used 
lexical chains to distinguish easy from difficult medical texts. Identifying lexical 
chains that signal a difficult sentence helps in the simplification process. Olena 
[7] proposed a method for identifying lexical chains based on graphs, in which 
the nodes represent the terms in the document, and the edges the semantic rela- 
tions between them. More recently, Ruas et al. [8] combined lexical chains with 
word embeddings to classify documents. 

We introduce and evaluate a novel operationalization of voices using 
BERT [9], a state-of-the-art language model. This model enhances even further 
the Cohesion Network Analysis graph from the ReaderBench framework [10,11] 
by integrating semantic links of related concepts, indicative of semantic flow [12]. 


2 Method 


A specific dataset with examples of links was required to identify the attention 
heads from BERT capable of detecting semantic links between words that belong 
to the same chain. A set of simple heuristics were used to extract links from 
sample texts, for all pairs of words tagged as noun, verb, or pronoun that fulfil 
one of the following conditions: a) repetitions of words having the same lemma; 
b) synonyms, hypernyms, or siblings in the WordNet taxonomy [13]; and c) 
coreferences identified using spaCy!. The TASA corpus? was selected as reference 
due to its diversity and covered complexity levels. The “correct” pairs were 
extracted from the entire dataset using the previous rules, while the “incorrect” 
ones were randomly sampled with 10% probability from all pairs of words that 
were not selected (i.e., otherwise, the number of negative samples would have 
been one order of magnitude larger than “correct” semantic associations). In 
total, 49 million word pairs were extracted, out of which around 20 million were 
positive examples. 

Transformer-based models, in particular BERT [9], build contextual represen- 
tations of words by stacking multi-head attention layers. Besides state-of-the-art 
results obtained on a vast range of tasks in Natural Language Processing, these 
models also provide insights regarding the importance of words and the relations 
between them by looking at the attention values. Clark et al. [14] explored the 
interpretability of different attention heads from different layers of the BERT 
model. The authors show that attention heads can be used to identify specific 
syntactic functions or perform coreference resolution. 

No single attention head is accurate enough to predict these kinds of semantic 
relationships between words. Therefore, a prediction model that learns to com- 
bine the attention values from all the attention heads between two words was 
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trained on the dataset constructed based on TASA. By considering both direc- 
tions of the attention heads, 288 scores were used in total, similar to the approach 
used by Clark et al. [14]. An issue to be tackled was the limited sequence length 
accepted by the pretrained BERT model (i.e., 512 tokens). Texts in the TASA 
dataset, but also in general, can be longer; thus, a sliding window was used to 
compute the attention weights for all pairs of words. The sliding window had 
a length of 256 for efficiency reasons, but also because semantic chains usually 
do not contain links that are too far apart. An overlap of 128 tokens was used 
so that words on different sides of the window could still be connected; if two 
different attention values are computed between the same two words (because 
of this overlap), the maximum value was used as the weight. 

The previously described prediction model was used to score all pairs of words 
that are within a given distance in the text. The next step consisted of grouping 
these pairs of words into sets of semantically related words, i.e., semantic chains. 
In order to filter the links based on the predicted weight, a fixed threshold 
was experimentally set at 0.90. The semantic chains are selected in the form of 
connected components from the resulting graph. 


3 Results 


Different architectures for identifying semantic links were trained and evaluated: 
a linear model that only computes one weight for each attention head, and Multi- 
Layer Perceptron (MLP) with one or two hidden layers. All models return one 
number passed through a Sigmoid activation (see Table 1). 


Table 1. Link prediction results. 


Model | Hidden layer size | Accuracy (%) 
Linear | — 79.75 
MLP /16 85.67 
MLP /|32 86.24 
MLP | 64 86.65 
MLP | 64, 64 87.43 
MLP | 128, 64 87.99 


An interactive view developed using Angular 6 (https://angular.io) was intro- 
duced to display the semantic chains - see Fig. 1 for a text selected from the 
dataset described in McNamara et al. [15]. Each sentence is represented in a row, 
while rows are grouped in their corresponding paragraph. Words and links from 
a semantic chain share the same color. A higher density of the chains extracted 
with our method can be observed in contrast to classical lexical chains. Surpris- 
ing relations not present in the constructed dataset can be seen in the generated 
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chains. The linear model found connections between “colonists” and “Boston”, 
or between “help” and “supplies”, while the MLP model identified connections 
between “British” and “Great Britain” as a compound word. This example also 
shows that choosing the best model between linear and MLP is not straight- 
forward, despite the substantial performance improvement of the latter on the 
word pairs dataset. Even though the linear model cannot perfectly learn the 
simple heuristics used to build the initial dataset, it can retrieve new insightful 
connections between words. 
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Fig. 1. Visualizations of a) lexical chains [5], b) semantic chains using the linear model, 
and c) semantic chains using the MLP model. 


4 Conclusions 


A novel method for identifying semantic links is introduced using only the atten- 
tion scores computed by BERT, a core task for operationalizing dialogism as a 
discourse model. Choosing which attention heads are relevant for this task and 
how to combine them was achieved by building a dataset with pairs of words 
with simple rules. The introduced visualization argues for a more dense captur- 
ing of inner semantic links between words and even compound words, which are 
quite sparse when considering manually defined synsets from WordNet. Our aim 
is to further extend this model with sentiment analysis features derived from 
local contexts captured by BERT, thus further enriching the analysis with the 
identification of convergent and divergent points of view. 
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